Monte Carlo Methods for Prediction & ControlΒΆ

By Rohit PardasaniΒΆ

What are Monte Carlo methods ?ΒΆ

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.
The underlying concept is to use randomness to solve problems that might be deterministic in principle.
They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches.

StepsΒΆ

Monte Carlo methods vary, but tend to follow a particular pattern:

  1. Define a domain of possible inputs
  2. Generate inputs randomly from a probability distribution over the domain
  3. Perform a deterministic computation on the inputs
  4. Aggregate the results

Example (Monte Carlo method applied to approximating the value of Ο€)ΒΆ

Consider a quadrant (circular sector) inscribed in a unit square. Given that the ratio of their areas is Ο€/4,the value of Ο€ can be approximated using a Monte Carlo method:

  1. Draw a square, then inscribe a quadrant within it
  2. Uniformly scatter a given number of points over the square
  3. Count the number of points inside the quadrant, i.e. having a distance from the origin of less than 1
  4. The ratio of the inside-count and the total-sample-count is an estimate of the ratio of the two areas, Ο€/4. Multiply the result by 4 to estimate Ο€.

Why use Monte Carlo methods instead of Dynamic Programming ?ΒΆ

In the above case, exact value for the average is 42 !

Using Monte Carlo for prediction ?ΒΆ

Using Monte Carlo for Action ValuesΒΆ

Using Monte Carlo for Generalized Policy IterationΒΆ

Solving BlackjackΒΆ

Training the agent to play Blackjack using Monte Carlo with exploring starts.
This means that we should start with random state and random action and later follow policy.
Blackjack naturally starts with random state.
So, the first action by agent is random hit or stick.
And later policy described in above case is followed.
Below is estimate from the first episode. We continue to refine estimates after iterating through many such episodes. And finally get graph as shown.

Epsilon-soft policiesΒΆ

Why does off-policy learning matter ?ΒΆ

Importance SamplingΒΆ

Off Policy Monte Carlo PredictionΒΆ